Skip to content

snapshot: track completed sync terms for idempotent standby requests#461

Open
MrGuin wants to merge 1 commit intomainfrom
fix_barrier
Open

snapshot: track completed sync terms for idempotent standby requests#461
MrGuin wants to merge 1 commit intomainfrom
fix_barrier

Conversation

@MrGuin
Copy link
Collaborator

@MrGuin MrGuin commented Mar 17, 2026

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Improved snapshot synchronization to standby nodes with better handling of duplicate requests and enhanced retry logic for failed syncs.
    • Optimized state management to prevent incomplete sync operations from causing future failures.

@coderabbitai
Copy link

coderabbitai bot commented Mar 17, 2026

Walkthrough

This PR adds completion tracking per standby node and term to SnapshotManager via a new unordered_map-based data structure, introduces three helper methods to manage completion state (check, mark, erase), and integrates completion checks into snapshot sync registration and SyncWithStandby flow to prevent duplicate syncs and clean up stale data.

Changes

Cohort / File(s) Summary
Header declarations
tx_service/include/store/snapshot_manager.h
Added completed_snapshot_terms_ data member to track completed snapshot syncs per standby node/term, and three private method signatures for managing this state.
Implementation and integration
tx_service/src/store/snapshot_manager.cpp
Implemented completion tracking helpers; integrated completion checks into snapshot sync registration to skip duplicates; added pruning of stale completed terms; modified barrier registration and cleanup logic; updated SyncWithStandby to mark syncs as completed and manage pending state for retries; cleared completed terms on leader invalidation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • liunyl
  • lzxddz

Poem

🐰 A snapshot born, a sync complete,
No duplicates to skip a beat!
We mark and track each finishing term,
Stale data cleaned—a hard-earned turn,
Retries ready, barriers neat! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description is entirely empty, failing to provide any context, motivation, or explanation of the changes; additionally, none of the template checklist items are addressed. Add a comprehensive description explaining the purpose of tracking completed snapshot terms, why idempotency is important, and how this change addresses it; address all template checklist items including testing, documentation, and issue/RFC references.
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: tracking completed sync terms to support idempotent standby requests, which aligns with the substantive changes across both the header and implementation files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix_barrier
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.

Change the reviews.profile setting to assertive to make CodeRabbit's nitpick more issues in your PRs.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tx_service/src/store/snapshot_manager.cpp (1)

102-117: ⚠️ Potential issue | 🟠 Major

Treat superseded standby terms as safe duplicates here too.

Line 105 and Line 123 only dedupe exact completed terms. Once the same node already has a newer barrier, pending task, or completed term, an out-of-order request for an older term still falls through to false here, even though RegisterSubscriptionBarrier() already ignores those older terms as stale. In a reordered RPC sequence, term T can start getting rejected after term T+1 is known, which breaks the idempotent/stale-request behavior this PR is aiming for. Consider returning true whenever the request term is older than the newest known barrier/pending/completed term for that node.

Also applies to: 120-135

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tx_service/src/store/snapshot_manager.cpp` around lines 102 - 117, When
handling an incoming snapshot sync request, treat any request whose term is
older than the newest known term for that standby as a safe duplicate: when
node_it == subscription_barrier_.end() first check
IsSnapshotSyncCompletedLocked(req->standby_node_id(), req->standby_node_term())
OR whether the node’s newest known term (from subscription_barrier_, any
pending-task structures, or the completed-term tracking used by
IsSnapshotSyncCompletedLocked) is greater than req->standby_node_term(); if so
return true instead of false. Likewise, when node_it !=
subscription_barrier_.end(), compare req->standby_node_term() against the stored
barrier/pending term (node_it->second or the pending task entry) and return true
if the request term is strictly less (stale). Apply the same
“older-than-newest-known-term => return true” logic to the other similar block
around lines 120-135 so superseded terms are treated as idempotent duplicates;
use subscription_barrier_, IsSnapshotSyncCompletedLocked, and the
pending/completed-term tracking already present to derive the newest known term.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tx_service/src/store/snapshot_manager.cpp`:
- Around line 102-117: When handling an incoming snapshot sync request, treat
any request whose term is older than the newest known term for that standby as a
safe duplicate: when node_it == subscription_barrier_.end() first check
IsSnapshotSyncCompletedLocked(req->standby_node_id(), req->standby_node_term())
OR whether the node’s newest known term (from subscription_barrier_, any
pending-task structures, or the completed-term tracking used by
IsSnapshotSyncCompletedLocked) is greater than req->standby_node_term(); if so
return true instead of false. Likewise, when node_it !=
subscription_barrier_.end(), compare req->standby_node_term() against the stored
barrier/pending term (node_it->second or the pending task entry) and return true
if the request term is strictly less (stale). Apply the same
“older-than-newest-known-term => return true” logic to the other similar block
around lines 120-135 so superseded terms are treated as idempotent duplicates;
use subscription_barrier_, IsSnapshotSyncCompletedLocked, and the
pending/completed-term tracking already present to derive the newest known term.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6ee12fea-568a-4e7e-8713-2ce1b311fb0a

📥 Commits

Reviewing files that changed from the base of the PR and between a64a31d and 6c01aa1.

📒 Files selected for processing (2)
  • tx_service/include/store/snapshot_manager.h
  • tx_service/src/store/snapshot_manager.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant